Skip to content

fix(video): fix video stream freezing on capture re-init (e.g. pipewire display switch)#5249

Merged
ReenigneArcher merged 3 commits into
LizardByte:masterfrom
Kishi85:workaround-video-stream-freeze-on-display-switch
Jun 26, 2026
Merged

fix(video): fix video stream freezing on capture re-init (e.g. pipewire display switch)#5249
ReenigneArcher merged 3 commits into
LizardByte:masterfrom
Kishi85:workaround-video-stream-freeze-on-display-switch

Conversation

@Kishi85

@Kishi85 Kishi85 commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Description

This is a trial-and-error workaround to avoid video stream freezing when display switching that somehow happens when not using the added thread sleeps due to a racecondition that needs to be further diagnosed and might even be outside Sunshine's control. Adding a 0.01ms sleep at the relevant codepoints (see changes) seems to fix the issue reliably without creating unwanted behaviour/other issues.

Also the log level for pipewire state changes to ease initial analysis of future issues without having debug logging enabled.

This issue was identified after working around the FFmpeg segfault in #4943 but might occur independently from that. Unsure if this can also help with #5241 but should be tested as it might be the same root cause.

Note: This issue is almost impossible to pinpoint as it is only occurring when running as a systemd user service without having debug logging enabled (so BOOST_LOG(debug) does noop). If run with debug logging enabled, the added runtime by BOOST_LOG is sufficient to allievate the issue. Same is true if running with a debugger attached. The only way to figure out the relevant code segment was by changing the loglevel for each log statement in pipewire.cpp and trying to trigger the issue. After doing that the following codesegment was found to fix the issue when using log level info:

if (!push_captured_image_cb(std::move(img_out), true)) {
BOOST_LOG(debug) << "[pipewire] PipeWire: !push_captured_image_cb -> ok";
return platf::capture_e::ok;
}

That is why the workaround now adds a sleep right before the return (and does so also for the timeout case).

Screenshot

Issues Fixed or Closed

Roadmap Issues

Type of Change

  • feat: New feature (non-breaking change which adds functionality)
  • fix: Bug fix (non-breaking change which fixes an issue)
  • docs: Documentation only changes
  • style: Changes that do not affect the meaning of the code (white-space, formatting, missing semicolons, etc.)
  • refactor: Code change that neither fixes a bug nor adds a feature
  • perf: Code change that improves performance
  • test: Adding missing tests or correcting existing tests
  • build: Changes that affect the build system or external dependencies
  • ci: Changes to CI configuration files and scripts
  • chore: Other changes that don't modify src or test files
  • revert: Reverts a previous commit
  • BREAKING CHANGE: Introduces a breaking change (can be combined with any type above)

Checklist

  • Code follows the style guidelines of this project
  • Code has been self-reviewed
  • Code has been commented, particularly in hard-to-understand areas
  • Code docstring/documentation-blocks for new or existing methods/components have been added or updated
  • Unit tests have been added or updated for any new or modified functionality

AI Usage

  • None: No AI tools were used in creating this PR
  • Light: AI provided minor assistance (formatting, simple suggestions)
  • Moderate: AI helped with code generation or debugging specific parts
  • Heavy: AI generated most or all of the code changes

@Kishi85 Kishi85 force-pushed the workaround-video-stream-freeze-on-display-switch branch 2 times, most recently from b29d83a to d633d52 Compare June 4, 2026 14:28
@Kishi85 Kishi85 force-pushed the workaround-video-stream-freeze-on-display-switch branch 4 times, most recently from bb8e6da to 1003b5d Compare June 5, 2026 14:08
@ReenigneArcher ReenigneArcher force-pushed the workaround-video-stream-freeze-on-display-switch branch from 1003b5d to f8bb8ae Compare June 5, 2026 17:06
@codecov

codecov Bot commented Jun 5, 2026

Copy link
Copy Markdown

Bundle Report

Bundle size has no change ✅

@codecov

codecov Bot commented Jun 5, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 5.00000% with 19 lines in your changes missing coverage. Please review.
✅ Project coverage is 27.54%. Comparing base (1cde8fe) to head (1c97c44).
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/platform/linux/pipewire.cpp 0.00% 11 Missing ⚠️
src/thread_safe.h 12.50% 7 Missing ⚠️
src/video.cpp 0.00% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5249      +/-   ##
==========================================
- Coverage   27.55%   27.54%   -0.01%     
==========================================
  Files         113      113              
  Lines       25589    25596       +7     
  Branches    11238    11239       +1     
==========================================
+ Hits         7050     7051       +1     
+ Misses      16593    15284    -1309     
- Partials     1946     3261    +1315     
Flag Coverage Δ
Archlinux 11.49% <0.00%> (-0.01%) ⬇️
FreeBSD-amd64 13.31% <0.00%> (-0.01%) ⬇️
Homebrew-macos-14 21.01% <0.00%> (-0.02%) ⬇️
Homebrew-macos-15 21.18% <0.00%> (-0.02%) ⬇️
Homebrew-macos-26 21.31% <0.00%> (-0.02%) ⬇️
Homebrew-ubuntu-24.04 13.52% <0.00%> (-0.01%) ⬇️
Linux-AppImage 12.39% <0.00%> (-0.01%) ⬇️
Windows-AMD64 15.26% <0.00%> (-0.01%) ⬇️
Windows-ARM64 13.29% <0.00%> (-0.01%) ⬇️
macOS-arm64 19.25% <0.00%> (-0.02%) ⬇️
macOS-x86_64 18.71% <0.00%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/video.cpp 52.73% <0.00%> (+0.02%) ⬆️
src/thread_safe.h 69.87% <12.50%> (-1.68%) ⬇️
src/platform/linux/pipewire.cpp 0.16% <0.00%> (-0.01%) ⬇️

... and 64 files with indirect coverage changes


Continue to review full report in Codecov by Harness.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1cde8fe...1c97c44. Read the comment docs.

@ReenigneArcher

Copy link
Copy Markdown
Member

Does this resolve #4943 then? And, do you want to merge this or wait for tester feedback (it's unclear from the PR description)?

@ReenigneArcher ReenigneArcher added this to the pipewire milestone Jun 5, 2026
@Kishi85

Kishi85 commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

It's the fix for the second issue that I've identified in #4943 after commenting out the frame flush (still waiting on feedback in #4943 for that). The video stream freeze issue can occur independently although I'm unsure about the root cause as it's some kind of race condition with an extremely tight timing. It would probably be good to have someone else look at this if possible but I'd also rather not delay this too long in case it also helps with #5241. Maybe @psyke83 could check this if his time permits? By itself this shouldn't break anything as it is only adding 100us to the display switch capture loop return (and only to that specific case).

I can also add the removal of the frame flush to this PR depending on the final verdict in #4943 if you want to wait for that.

@Kishi85 Kishi85 force-pushed the workaround-video-stream-freeze-on-display-switch branch 2 times, most recently from f4d7ea3 to db9699b Compare June 6, 2026 07:04
@Kishi85

Kishi85 commented Jun 6, 2026

Copy link
Copy Markdown
Contributor Author

To speed this up I've removed the logic order change (as that's the thing I was really unsure about) for the capture timeout case (as it is working as it is now). If someone ever finds and fixes the root cause or has a better solution to avoid the video stream freeze the workaround can/should be replaced/removed.

The way it is now I'm good with merging the change.

Still needs a fix for the FFmpeg segfault to fully handle #4943 for that see #5257, otherwise it needs to be fixed in upstream FFmpeg.

@psyke83

psyke83 commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

You mention that the issue can't be reproduced when Sunshine debug is active, but have you tried checking (only) the pipewire debug output for any clues when this happens? Adding PIPEWIRE_DEBUG=4 into Sunshine's runtime environment should do the trick.

@Kishi85

Kishi85 commented Jun 7, 2026

Copy link
Copy Markdown
Contributor Author

Pipewire debug output (although it is a lot) shows the following right before the stream freezes:

...
Jun 07 08:39:37 desktop sunshine[11991]: D pw.stream [stream.c:424:stream_set_state]: 0x7f95a8000fe0: update state from paused -> streaming (0) (null)
Jun 07 08:39:37 desktop sunshine[11991]: D spa.videoadapter [videoadapter.c:1155:impl_node_send_command]: 0x7f95a8009d08: started
Jun 07 08:39:37 desktop sunshine[11991]: D pw.node [impl-node.c:2748:on_state_complete]: 0x7f95a811dcc0: state complete res:0 seq:0
Jun 07 08:39:37 desktop sunshine[11991]: D pw.node [impl-node.c:459:node_update_state]: 0x7f95a811dcc0: start node driving:0 driver:0 prepared:0
Jun 07 08:39:37 desktop sunshine[11991]: D pw.node [impl-node.c:148:activate_target]: 0x7f95a811dcc0: target state:0x7f964c003008 id:112 pending:0/0 1:0:1
Jun 07 08:39:37 desktop sunshine[11991]: D pw.node [impl-node.c:493:node_update_state]: 0x7f95a811dcc0: (sunshine) suspended -> running ((null))
Jun 07 08:39:37 desktop sunshine[11991]: I pw.node [impl-node.c:503:node_update_state]: (sunshine-137) suspended -> running
Jun 07 08:39:37 desktop sunshine[11991]: D mod.client-node [remote-node.c:1009:node_info_changed]: info changed 0x7f95a800b7e8
Jun 07 08:39:37 desktop sunshine[11991]: D mod.protocol-native [connection.c:280:clear_buffer]: 0x7f95a8120b60 clear fds:0 n_fds:3
Jun 07 08:39:37 desktop sunshine[11991]: D pw.core [core.c:41:core_event_ping]: 0x7f95a80b3420: object 2 ping 1073741865
Jun 07 08:39:37 desktop sunshine[11991]: radv: RADV_PERFTEST=video_encode is deprecated and will be removed in future Mesa releases. Please use RADV_EXPERIMENTAL=video_encode instead.
Jun 07 08:39:37 desktop sunshine[11991]: WARNING: radv is not a conformant Vulkan implementation, testing use only.
Jun 07 08:39:37 desktop sunshine[11991]: [2026-06-07 08:39:37.270]: Info: Streaming bitrate is 30988000
Jun 07 08:39:37 desktop sunshine[11991]: [2026-06-07 08:39:37.272]: Info: Vulkan encode using GPU: AMD Radeon RX 9070 XT (RADV GFX1201)
Jun 07 08:39:37 desktop sunshine[11991]: [2026-06-07 08:39:37.272]: Info: Minimum FPS target set to ~30fps (33.3333ms)

If I'm reading this correctly then pipewire should be streaming already (first line) but whatever happens in Sunshine's video capture loop gets stuck without that tiny delay (Audio still plays, further display switching breaks likely due to push_capture_image_cb no longer getting called) and the encoder does not even properly create.

As already noted I've not found a way to properly debug this yet as even the debugger (or debug logging) throws the timing off enough to not trigger the issue.

With this PR applied I'm getting the same Pipewire debug output but this time the encoder properly creates and starts up:

Jun 07 08:49:49 desktop sunshine[19868]: D pw.stream [stream.c:424:stream_set_state]: 0x7f83200fcf00: update state from paused -> streaming (0) (null)
Jun 07 08:49:49 desktop sunshine[19868]: D spa.videoadapter [videoadapter.c:1155:impl_node_send_command]: 0x7f8320127a68: started
Jun 07 08:49:49 desktop sunshine[19868]: D pw.node [impl-node.c:2748:on_state_complete]: 0x7f83201322b0: state complete res:0 seq:0
Jun 07 08:49:49 desktop sunshine[19868]: D pw.node [impl-node.c:459:node_update_state]: 0x7f83201322b0: start node driving:0 driver:0 prepared:0
Jun 07 08:49:49 desktop sunshine[19868]: D pw.node [impl-node.c:148:activate_target]: 0x7f83201322b0: target state:0x7f83c00b3008 id:113 pending:0/0 1:0:1
Jun 07 08:49:49 desktop sunshine[19868]: D pw.node [impl-node.c:493:node_update_state]: 0x7f83201322b0: (sunshine) suspended -> running ((null))
Jun 07 08:49:49 desktop sunshine[19868]: I pw.node [impl-node.c:503:node_update_state]: (sunshine-138) suspended -> running
Jun 07 08:49:49 desktop sunshine[19868]: D mod.client-node [remote-node.c:1009:node_info_changed]: info changed 0x7f83200192e8
Jun 07 08:49:49 desktop sunshine[19868]: D mod.protocol-native [connection.c:280:clear_buffer]: 0x7f8320103bd0 clear fds:0 n_fds:3
Jun 07 08:49:49 desktop sunshine[19868]: D pw.core [core.c:41:core_event_ping]: 0x7f83200a0920: object 2 ping 1073741865
Jun 07 08:49:49 desktop sunshine[19868]: [2026-06-07 08:49:49.145]: Info: Creating encoder [hevc_vulkan]
Jun 07 08:49:49 desktop sunshine[19868]: [2026-06-07 08:49:49.145]: Info: Color coding: SDR (Rec. 709)
Jun 07 08:49:49 desktop sunshine[19868]: [2026-06-07 08:49:49.145]: Info: Color depth: 8-bit
Jun 07 08:49:49 desktop sunshine[19868]: [2026-06-07 08:49:49.145]: Info: Color range: JPEG
Jun 07 08:49:49 desktop sunshine[19868]: radv: RADV_PERFTEST=video_encode is deprecated and will be removed in future Mesa releases. Please use RADV_EXPERIMENTAL=video_encode instead.
Jun 07 08:49:49 desktop sunshine[19868]: WARNING: radv is not a conformant Vulkan implementation, testing use only.
Jun 07 08:49:49 desktop sunshine[19868]: radv: RADV_PERFTEST=video_encode is deprecated and will be removed in future Mesa releases. Please use RADV_EXPERIMENTAL=video_encode instead.
Jun 07 08:49:49 desktop sunshine[19868]: WARNING: radv is not a conformant Vulkan implementation, testing use only.
Jun 07 08:49:49 desktop sunshine[19868]: [2026-06-07 08:49:49.272]: Info: Streaming bitrate is 30988000
Jun 07 08:49:49 desktop sunshine[19868]: [2026-06-07 08:49:49.274]: Info: Vulkan encode using GPU: AMD Radeon RX 9070 XT (RADV GFX1201)
Jun 07 08:49:49 desktop sunshine[19868]: [2026-06-07 08:49:49.274]: Info: Minimum FPS target set to ~30fps (33.3333ms)

The only difference on pipewire debug output I've seen once after was (and that only because I've been button mashing CTRL+ALT+SHIFT+F1 as hard as possible):

Jun 07 08:49:32 desktop sunshine[19868]: I pw.node [impl-node.c:1577:node_on_fd_events]: (sunshine-112) client missed 1 wakeups

The freeze only happens on portal-/kwingrab, not on kmsgrab and I've also tested both vulkan and vaapi encoders to make sure it's not related to them.

I'm not fully satisfied with adding a microsleep to workaround this issue but so far it's the best thing I could come up with and it's the only thing that has proven to reliably work. Therefore I've marked this as a workaround in code until a better solution is found.

@psyke83

psyke83 commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Can you see if the issue can reproduce with active content on the host? Leave vrrtest running, for example. When troubleshooting mode change hangs there was one specific case where a hang occurred only when changing modes on an idle desktop.

@Kishi85

Kishi85 commented Jun 7, 2026

Copy link
Copy Markdown
Contributor Author

Also happens on non-idle stream. Tested with running a YouTube video. That's where I noted audio continuing despite the video stream freeze.

@Kishi85

Kishi85 commented Jun 7, 2026

Copy link
Copy Markdown
Contributor Author

To clarify: The issue does not occur on every display switch but has a noticeable chance to occur (like 3 in 10 switches). It can also be force by button mashing any display switch button repeatedly (Note: you might trigger #4943 as well without the proper fix in place).

@Kishi85

Kishi85 commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

After looking into this I've unfortunately still not come up with a better solution so far.

Due to the workaround functioning the way it is implemented hints at the codepath for !push_captured_image_cb-cases being the only thing affected. It is triggered at 2 points:

  • Capture context stopped and display-switch events (Lines 1417/1426):

    Sunshine/src/video.cpp

    Lines 1401 to 1429 in ea88d71

    auto push_captured_image_callback = [&](std::shared_ptr<platf::img_t> &&img, bool frame_captured) -> bool {
    KITTY_WHILE_LOOP(auto capture_ctx = std::begin(capture_ctxs), capture_ctx != std::end(capture_ctxs), {
    if (!capture_ctx->images->running()) {
    capture_ctx = capture_ctxs.erase(capture_ctx);
    continue;
    }
    if (frame_captured) {
    capture_ctx->images->raise(img);
    }
    ++capture_ctx;
    })
    if (!capture_ctx_queue->running()) {
    return false;
    }
    while (capture_ctx_queue->peek()) {
    capture_ctxs.emplace_back(std::move(*capture_ctx_queue->pop()));
    }
    if (switch_display_event->peek()) {
    artificial_reinit = true;
    return false;
    }
    return true;
  • Encode session errors (Lines 2331/2339):

    Sunshine/src/video.cpp

    Lines 2327 to 2340 in ea88d71

    auto push_captured_image_callback = [&](std::shared_ptr<platf::img_t> &&img, bool frame_captured) -> bool {
    while (encode_session_ctx_queue.peek()) {
    auto encode_session_ctx = encode_session_ctx_queue.pop();
    if (!encode_session_ctx) {
    return false;
    }
    synced_session_ctxs.emplace_back(std::make_unique<sync_session_ctx_t>(std::move(*encode_session_ctx)));
    auto encode_session = make_synced_session(disp.get(), encoder, *img, *synced_session_ctxs.back());
    if (!encode_session) {
    ec = platf::capture_e::error;
    return false;
    }

@Kishi85 Kishi85 force-pushed the workaround-video-stream-freeze-on-display-switch branch from db9699b to 0931ebc Compare June 12, 2026 06:21
@ReenigneArcher ReenigneArcher force-pushed the workaround-video-stream-freeze-on-display-switch branch from 0931ebc to 4ac5bf0 Compare June 12, 2026 17:13
@Kishi85 Kishi85 force-pushed the workaround-video-stream-freeze-on-display-switch branch 3 times, most recently from e4d1455 to be6f953 Compare June 17, 2026 14:16
@Kishi85

Kishi85 commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

@psyke83 after doing more digging I've noted we're stopping the pw_thread_loop twice. Could you check if 5de0ca7 is the proper fix for this? I'm still seeing hangs when switching without the added delay unfortunately.

I've also found out that having more that one clients connected adds enough delay to throw this racecondition off. I've reduced the workaround delay by another magnitude (to 10us) and still cannot get hangs that way (at least so far) but can force them easily without an added delay.

@Kishi85 Kishi85 force-pushed the workaround-video-stream-freeze-on-display-switch branch from be6f953 to 37b59df Compare June 17, 2026 14:28
@Kishi85

Kishi85 commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

I've finally managed to get a debugger attached when the issue has already occurred and the video::captureThread seems to be stuck on a conditional variable in thread_safe.h:50:

status_t pop() {
std::unique_lock ul {_lock};
if (!_continue) {
return util::false_v<status_t>;
}
while (!_status) {
_cv.wait(ul);
if (!_continue) {
return util::false_v<status_t>;
}
}
auto val = std::move(_status);
_status = util::false_v<status_t>;
return val;
}

EDIT: Could also be that the while-loop is not exiting here as _status = NULL acccording to the debugger while _continue remaining true.

The next (and only remaining) function in the call stack for the capture thread is video.cpp:1458:

Sunshine/src/video.cpp

Lines 1450 to 1469 in 8a0525a

// display_wp is modified in this thread only
// Wait for the other shared_ptr's of display to be destroyed.
// New displays will only be created in this thread.
while (display_wp->use_count() != 1) {
// Free images that weren't consumed by the encoders. These can reference the display and prevent
// the ref count from reaching 1. We do this here rather than on the encoder thread to avoid race
// conditions where the encoding loop might free a good frame after reinitializing if we capture
// a new frame here before the encoder has finished reinitializing.
KITTY_WHILE_LOOP(auto capture_ctx = std::begin(capture_ctxs), capture_ctx != std::end(capture_ctxs), {
if (!capture_ctx->images->running()) {
capture_ctx = capture_ctxs.erase(capture_ctx);
continue;
}
while (capture_ctx->images->peek()) {
capture_ctx->images->pop();
}
++capture_ctx;
});

Meanwhile the session::video thread is stuck in the reinit_event section waiting for the capture thread to pick back up:

Sunshine/src/video.cpp

Lines 2489 to 2494 in 8a0525a

while (!shutdown_event->peek() && images->running()) {
// Wait for the main capture event when the display is being reinitialized
if (ref->reinit_event.peek()) {
std::this_thread::sleep_for(20ms);
continue;
}

@Kishi85 Kishi85 force-pushed the workaround-video-stream-freeze-on-display-switch branch 2 times, most recently from eb8607d to 71dd734 Compare June 23, 2026 19:37
@Kishi85

Kishi85 commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

With the debugger results I've updated the fix/workaround for this issue to do one pull_free_image_cb (similar to what is done for timeout detection) which seems to keep the capture_ctx free image part going.

I'm unsure if this just works due to the added runtime or if it fixes the root cause properly.

Unfortunately, doesn't work so I've reverted this back to the sleep workaround.

@psyke83 I'd be grateful if you could have a look at the debugger results if your time permits? Hopefully you have a deeper understanding of what's really going on (or rather wrong) here.

@Kishi85 Kishi85 force-pushed the workaround-video-stream-freeze-on-display-switch branch from 71dd734 to a0c106d Compare June 23, 2026 19:48
@Kishi85

Kishi85 commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

Digging into this some more I've noticed that src/thread_safe.h was changed in #4967 in way that might explain the hanging:

  • pop(std::chrono::duration<Rep, Period> delay) was changed to reset the status value at the end
  • Meanwhile peek(), pop(), view() and view(std::chrono::duration<Rep, Period> delay) have not been updated

With _status.reset() the resulting nullptr would make the while (!_status) loops of the untouched functions run indefinitely if I'm understanding this correctly.

@psyke83 Since this was your change could you advise the best course of action? I've tried a fix with
ac5b4e9 but I'm unsure if that is enough or if it breaks something else.

EDIT: Added peek() to the not updated list

EDIT2: ac5b4e9 is unfortunately not enough to fix this as I can still trigger the issue (although not as often as before)

EDIT3: Trying with fully reverted thread_safe.h next

EDIT4: Issue is still occurring with fully reverted thread_safe.h so it's more likely a proper fix has to be done in a different way but so far I'm a bit clueless as to how that would be done. I'll revert this back to the sleep workaround stage for now but the capture thread is always getting stuck at the same point in thread_safe.h:50 so far.

@Kishi85 Kishi85 force-pushed the workaround-video-stream-freeze-on-display-switch branch from b212fa3 to a0c106d Compare June 24, 2026 14:58
@psyke83

psyke83 commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

I'm trying to get caught up; your proposed fix for the crash in #5257 works for me, but only for VAAPI; Vulkan still crashes. So to test this PR I'm keeping #5257 applied, but sticking to VAAPI and kwingrab. Additionally, to eliminate variables during testing, I'm keeping a glxgears window running that's intersecting on both screens to rule out issues with pipewire genuinely not sending buffers due to an idle screen. I'm also making sure to leave the systray enabled, as I recall that was causing odd issues with previous bugs related to pipewire capture.

With the above noted, I'm able to reproduce the freezing which is definitely ameliorated with this PR. If I revert a0c106d but keep the pw_thread_loop commit, I can once again reproduce the freezing, but unlike what you've reported, it does seem to resolve the freezing if I revert my recent changes to pop() in thread_safe.h.

This is not related to this PR directly, but Vulkan seems to be unstable with both display and mode switching on my system on master branch, with both of your PRs, or even with the thread_safe.h changes reverted. I wrote a stress script for KDE:

#!/bin/bash

OUTPUT="1" # Change this to your output ID from kscreen-doctor -o
RES_A="1920x1080@60"
RES_B="3840x2160@60"
TIME=$1

if [[ -z "$TIME" ]]; then
  TIME=5
fi

for i in {1..100}
do
   echo "Iteration $i: Switching to $RES_B"
   kscreen-doctor output.$OUTPUT.mode.$RES_B
   sleep $TIME
   echo "Iteration $i: Switching to $RES_A"
   kscreen-doctor output.$OUTPUT.mode.$RES_A
   sleep $TIME
done

echo "Stress test complete."

With a timeout smaller than 5 this almost immediately triggers a crash with Vulkan, but VAAPI survives 100 cycles even with 0 sleep. I would suggest checking to see if mode changes cause crashing in Vulkan on your system; if so, Vulkan may need separate fixes that are interfering with your testing of display switching freezes in this PR and crashes in the other PR.

@Kishi85

Kishi85 commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

@psyke83 Thanks for looking into this and also for working on/creating a reliable reproducer.

I probably shouldn't have put 4 edits in my last comment (didn't want to make too many separate small comments) but I've also realized that reverting thread_safe.h does indeed not fix this issue. I've just not managed to trigger it when writing my initial comment. The freeze however is related to an indefinite wait in thread_safe.h:50 so at least we know where to start. I'm tending to think it might be a synchronization issue due to peek() not utilizing the lock:

Sunshine/src/thread_safe.h

Lines 116 to 118 in 0731729

bool peek() {
return _continue && (bool) _status;
}

But logically it would make more sense for the whole free images while block to be atomically locked (so a peek() being true will guarantee pop() has something to pop at all) but this might entail a new function in thread_safe.h to implement properly:

Sunshine/src/video.cpp

Lines 1464 to 1466 in 0731729

while (capture_ctx->images->peek()) {
capture_ctx->images->pop();
}

As for the vulkan crash I've managed to hit that once so far (after trying to add lock_guard to peek()) so it's good to know that this is there and I agree that this might be another issue that needs fixing. It's another crash issue (like #5257) so it's easily distinguishable from the freeze issue where the session does not die instantly or get shot down after 10 seconds and continues to run indefinitely (until you disconnect/close the session and then hang detection will kick in).

EDIT: I've looked through my logs and can see that the vulkan crash might be a double-free issue due to a race condition as I'm seeing the following on top of the related coredump:

Stack trace of thread 139838:
                                                  #0  0x00005650d89c016f ff_vk_free_buf (sunshine + 0x63216f)
                                                  #1  0x00005650d89c023d free_data_buf (sunshine + 0x63223d)
                                                  #2  0x00005650d8a89d07 buffer_pool_flush (sunshine + 0x6fbd07)
                                                  #3  0x00005650d8a89f16 buffer_replace (sunshine + 0x6fbf16)
                                                  #4  0x00005650d896e1ce av_packet_unref (sunshine + 0x5e01ce)
                                                  #5  0x00005650d896e23a av_packet_free (sunshine + 0x5e023a)
                                                  #6  0x00005650d876a362 _ZN5video18packet_raw_avcodecD2Ev (sunshine + 0x3dc362)
                                                  #7  0x00005650d876a38a _ZN5video18packet_raw_avcodecD0Ev (sunshine + 0x3dc38a)
                                                  #8  0x00005650d8737106 _ZNKSt14default_deleteIN5video12packet_raw_tEEclEPS1_ (sunshine + 0x3a9106)
                                                  #9  0x00005650d87302cf _ZNSt10unique_ptrIN5video12packet_raw_tESt14default_deleteIS1_EED2Ev (sunshine + 0x3a22cf)
                                                  #10 0x00005650d871f8ce _ZN6stream20videoBroadcastThreadERN5boost4asio21basic_datagram_socketINS1_2ip3udpENS1_15any_io_executorEEE (sunshine + 0x3918ce)
                                                  #11 0x00005650d874ffde _ZSt13__invoke_implIvPFvRN5boost4asio21basic_datagram_socketINS1_2ip3udpENS1_15any_io_executorEEEEJSt17reference_wrapperIS6_EEET_St14__invoke_otherOT0_DpOT1_ (sun>
                                                  #12 0x00005650d874fcd3 _ZSt8__invokeIPFvRN5boost4asio21basic_datagram_socketINS1_2ip3udpENS1_15any_io_executorEEEEJSt17reference_wrapperIS6_EEENSt15__invoke_resultIT_JDpT0_EE4typeEOSD_D>
                                                  #13 0x00005650d874f8dd _ZNSt6thread8_InvokerISt5tupleIJPFvRN5boost4asio21basic_datagram_socketINS3_2ip3udpENS3_15any_io_executorEEEESt17reference_wrapperIS8_EEEE9_M_invokeIJLm0ELm1EEEEv>
                                                  #14 0x00005650d874f6e0 _ZNSt6thread8_InvokerISt5tupleIJPFvRN5boost4asio21basic_datagram_socketINS3_2ip3udpENS3_15any_io_executorEEEESt17reference_wrapperIS8_EEEEclEv (sunshine + 0x3c16e>
                                                  #15 0x00005650d874f580 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJPFvRN5boost4asio21basic_datagram_socketINS4_2ip3udpENS4_15any_io_executorEEEESt17reference_wrapperIS9_EEEEEE6_M_r>
                                                  #16 0x00005650d9925d56 execute_native_thread_routine (sunshine + 0x1597d56)
                                                  #17 0x00007fc04b2ac6ec n/a (libc.so.6 + 0xac6ec)
                                                  #18 0x00007fc04b3477bc n/a (libc.so.6 + 0x1477bc)

EDIT2: For the vulkan crash it's in av_packet_free() called by sunshine but that can be called with a nullptr causing it to just no-op. the vulkan crash might be a bug in ffmpeg then.

@Kishi85

Kishi85 commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

Sunshine/src/video.cpp

Lines 1464 to 1466 in 0731729

while (capture_ctx->images->peek()) {
capture_ctx->images->pop();
}

Looking at this section more, it might make sense to just use a pop with a delay/timeout here as it should still keep going while peek is true but should never wait infinitely.

@psyke83

psyke83 commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

I think you've identified the source of the freeze, but isn't that also papering over the issue with a timeout? There seems to be a deadlock in the unbounded peek() -> pop() calls that's specific to pipewire, and seems likelier to occur with my changes to the unbounded pop() function (but you noted that reverting it didn't fix it on your system).

The peek() -> pop() cycle may be causing issues due to multiple callers/consumers that causes a peek to return true but the pop() is already empty (perhaps due to the async capture queue?). Although the peek() -> pop() call style is present in multiple parts of the code, can you try this diff focused specific on the problematic part? (full diff vs this PR + #5257):

diff --git a/src/platform/linux/pipewire.cpp b/src/platform/linux/pipewire.cpp
index 6deb9464..0ed5198c 100644
--- a/src/platform/linux/pipewire.cpp
+++ b/src/platform/linux/pipewire.cpp
@@ -882,14 +882,14 @@ namespace pipewire {
             }
             if (!push_captured_image_cb(std::move(img_out), false)) {
               BOOST_LOG(debug) << "[pipewire] PipeWire: timeout -> !push_captured_image_cb -> ok";
-              std::this_thread::sleep_for(10us);  // Workaround: delay OK to avoid video stream freezing
+              // std::this_thread::sleep_for(10us);  // Workaround: delay OK to avoid video stream freezing
               return platf::capture_e::ok;
             }
             break;
           case platf::capture_e::ok:
             if (!push_captured_image_cb(std::move(img_out), true)) {
               BOOST_LOG(debug) << "[pipewire] PipeWire: ok -> !push_captured_image_cb -> ok";
-              std::this_thread::sleep_for(10us);  // Workaround: delay OK to avoid video stream freezing
+              // std::this_thread::sleep_for(10us);  // Workaround: delay OK to avoid video stream freezing
               return platf::capture_e::ok;
             }
             break;
diff --git a/src/thread_safe.h b/src/thread_safe.h
index 4f2448f1..3274e1c4 100644
--- a/src/thread_safe.h
+++ b/src/thread_safe.h
@@ -38,6 +38,16 @@ namespace safe {
       _cv.notify_all();
     }
 
+    status_t try_pop() {
+      std::lock_guard lg {_lock};
+      if (!_status) {
+        return util::false_v<status_t>;
+      }
+      auto val = std::move(_status);
+      _status = util::false_v<status_t>;
+      return val;
+    }
+
     // pop and view should not be used interchangeably
     status_t pop() {
       std::unique_lock ul {_lock};
diff --git a/src/video.cpp b/src/video.cpp
index ae7050ab..bbefed6c 100644
--- a/src/video.cpp
+++ b/src/video.cpp
@@ -1461,8 +1461,7 @@ namespace video {
                   continue;
                 }
 
-                while (capture_ctx->images->peek()) {
-                  capture_ctx->images->pop();
+                while (capture_ctx->images->try_pop()) {
                 }
 
                 ++capture_ctx;

I can't reproduce any freeze with this change, but I'm unsure if there's a more appropriate fix to the issue. If this looks like the right track, we might want to evaluate other instances of unbounded peek() -> pop() calls that may also be susceptible to deadlocking. A quick grep makes me wonder if stream.cpp might benefit from this type of change as well.

$ git grep -nC1 "pop()" | grep -C1 "peek()"
--
src/input.cpp-463-    if (touch_port_event->peek()) {
src/input.cpp:464:      touch_port = *touch_port_event->pop();
--
--
src/stream.cpp-1117-            while (feedback_queue->peek()) {
src/stream.cpp:1118:              auto feedback_msg = feedback_queue->pop();
--
--
src/stream.cpp-1124-            while (session->control.peer && hdr_queue->peek()) {
src/stream.cpp:1125:              auto hdr_info = hdr_queue->pop();
--
--
src/stream.cpp-1197-      while (message_queue_queue->peek()) {
src/stream.cpp:1198:        auto message_queue_opt = message_queue_queue->pop();
--
src/stream.cpp:1297:    while (auto packet = packets->pop()) {
src/stream.cpp-1298-      if (shutdown_event->peek()) {
--
--
src/stream.cpp:1620:    while (auto packet = packets->pop()) {
src/stream.cpp-1621-      if (shutdown_event->peek()) {
--
--
--
src/video.cpp-1420-        while (capture_ctx_queue->peek()) {
src/video.cpp:1421:          capture_ctxs.emplace_back(std::move(*capture_ctx_queue->pop()));
--
--
src/video.cpp-1482-              if (switch_display_event->peek()) {
src/video.cpp:1483:                display_p = std::clamp(*switch_display_event->pop(), 0, (int) display_names.size() - 1);
--
--
src/video.cpp-2302-      if (switch_display_event->peek()) {
src/video.cpp:2303:        display_p = std::clamp(*switch_display_event->pop(), 0, (int) display_names.size() - 1);
--
--
src/video.cpp-2335-        while (encode_session_ctx_queue.peek()) {
src/video.cpp:2336:          auto encode_session_ctx = encode_session_ctx_queue.pop();
--
tests/unit/test_audio.cpp:55:    while (const auto packet = packets->pop()) {
tests/unit/test_audio.cpp-56-      if (shutdown_event->peek()) {
--
tests/unit/test_http_pairing.cpp-118-    ASSERT_EQ(add_cert->peek(), true);
tests/unit/test_http_pairing.cpp:119:    auto cert = add_cert->pop();

Edit: to clarify, I tried adding locking to the peek() function before trying the "try_pop()" strategy, but it didn't seem to help resolve the deadlock. I am curious why you were still seeing deadlocks with the reverted pop() function but I wasn't; perhaps I didn't test enough to reproduce the issue.

@Kishi85

Kishi85 commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

I think you've identified the source of the freeze, but isn't that also papering over the issue with a timeout? There seems to be a deadlock in the unbounded peek() -> pop() calls that's specific to pipewire, and seems likelier to occur with my changes to the unbounded pop() function (but you noted that reverting it didn't fix it on your system).

The peek() -> pop() cycle may be causing issues due to multiple callers/consumers that causes a peek to return true but the pop() is already empty (perhaps due to the async capture queue?).
I agree that this is likely a combination of the async capture queue and pipewire's event-based capture which is causing this issue to surface now.

Although the peek() -> pop() call style is present in multiple parts of the code, can you try this diff focused specific on the problematic part? (full diff vs this PR + #5257):

diff --git a/src/platform/linux/pipewire.cpp b/src/platform/linux/pipewire.cpp
index 6deb9464..0ed5198c 100644
--- a/src/platform/linux/pipewire.cpp
+++ b/src/platform/linux/pipewire.cpp
@@ -882,14 +882,14 @@ namespace pipewire {
             }
             if (!push_captured_image_cb(std::move(img_out), false)) {
               BOOST_LOG(debug) << "[pipewire] PipeWire: timeout -> !push_captured_image_cb -> ok";
-              std::this_thread::sleep_for(10us);  // Workaround: delay OK to avoid video stream freezing
+              // std::this_thread::sleep_for(10us);  // Workaround: delay OK to avoid video stream freezing
               return platf::capture_e::ok;
             }
             break;
           case platf::capture_e::ok:
             if (!push_captured_image_cb(std::move(img_out), true)) {
               BOOST_LOG(debug) << "[pipewire] PipeWire: ok -> !push_captured_image_cb -> ok";
-              std::this_thread::sleep_for(10us);  // Workaround: delay OK to avoid video stream freezing
+              // std::this_thread::sleep_for(10us);  // Workaround: delay OK to avoid video stream freezing
               return platf::capture_e::ok;
             }
             break;
diff --git a/src/thread_safe.h b/src/thread_safe.h
index 4f2448f1..3274e1c4 100644
--- a/src/thread_safe.h
+++ b/src/thread_safe.h
@@ -38,6 +38,16 @@ namespace safe {
       _cv.notify_all();
     }
 
+    status_t try_pop() {
+      std::lock_guard lg {_lock};
+      if (!_status) {
+        return util::false_v<status_t>;
+      }
+      auto val = std::move(_status);
+      _status = util::false_v<status_t>;
+      return val;
+    }
+
     // pop and view should not be used interchangeably
     status_t pop() {
       std::unique_lock ul {_lock};
diff --git a/src/video.cpp b/src/video.cpp
index ae7050ab..bbefed6c 100644
--- a/src/video.cpp
+++ b/src/video.cpp
@@ -1461,8 +1461,7 @@ namespace video {
                   continue;
                 }
 
-                while (capture_ctx->images->peek()) {
-                  capture_ctx->images->pop();
+                while (capture_ctx->images->try_pop()) {
                 }
 
                 ++capture_ctx;

I can't reproduce any freeze with this change, but I'm unsure if there's a more appropriate fix to the issue. If this looks like the right track, we might want to evaluate other instances of unbounded peek() -> pop() calls that may also be susceptible to deadlocking. A quick grep makes me wonder if stream.cpp might benefit from this type of change as well.

What I've been suggesting with my earlier comment was doing that part like this (I've tried this and it seems to work):

diff --git a/src/video.cpp b/src/video.cpp
index ae7050ab..5500fc06 100644
--- a/src/video.cpp
+++ b/src/video.cpp
@@ -1462,7 +1462,7 @@ namespace video {
                 }
 
                 while (capture_ctx->images->peek()) {
-                  capture_ctx->images->pop();
+                  capture_ctx->images->pop(20ms);
                 }
 
                 ++capture_ctx;

This does logically the same thing as what you're doing with try_pop() but with an added 20ms timeout.
Question now is which one is the better solution here? I like the simplicity of your try_pop() so I'll likely just change the PR to this solution unless we find something even better.

As for the other codeparts that could potentially be deadlocking I'm unsure whether those should be touched. The while peek loops should not cause any issues with try_pop but I'm not sure if it wouldn't be better to do pop(timeout) on the if peek ones to give them time to still return a proper result but I'm no expert on either of those usages:

$ git grep -nC1 "pop()" | grep -C1 "peek()"
--
src/input.cpp-463-    if (touch_port_event->peek()) {
src/input.cpp:464:      touch_port = *touch_port_event->pop();
--
--
src/stream.cpp-1117-            while (feedback_queue->peek()) {
src/stream.cpp:1118:              auto feedback_msg = feedback_queue->pop();
--
--
src/stream.cpp-1124-            while (session->control.peer && hdr_queue->peek()) {
src/stream.cpp:1125:              auto hdr_info = hdr_queue->pop();
--
--
src/stream.cpp-1197-      while (message_queue_queue->peek()) {
src/stream.cpp:1198:        auto message_queue_opt = message_queue_queue->pop();
--
src/stream.cpp:1297:    while (auto packet = packets->pop()) {
src/stream.cpp-1298-      if (shutdown_event->peek()) {
--
--
src/stream.cpp:1620:    while (auto packet = packets->pop()) {
src/stream.cpp-1621-      if (shutdown_event->peek()) {
--
--
--
src/video.cpp-1420-        while (capture_ctx_queue->peek()) {
src/video.cpp:1421:          capture_ctxs.emplace_back(std::move(*capture_ctx_queue->pop()));
--
--
src/video.cpp-1482-              if (switch_display_event->peek()) {
src/video.cpp:1483:                display_p = std::clamp(*switch_display_event->pop(), 0, (int) display_names.size() - 1);
--
--
src/video.cpp-2302-      if (switch_display_event->peek()) {
src/video.cpp:2303:        display_p = std::clamp(*switch_display_event->pop(), 0, (int) display_names.size() - 1);
--
--
src/video.cpp-2335-        while (encode_session_ctx_queue.peek()) {
src/video.cpp:2336:          auto encode_session_ctx = encode_session_ctx_queue.pop();
--
tests/unit/test_audio.cpp:55:    while (const auto packet = packets->pop()) {
tests/unit/test_audio.cpp-56-      if (shutdown_event->peek()) {
--
tests/unit/test_http_pairing.cpp-118-    ASSERT_EQ(add_cert->peek(), true);
tests/unit/test_http_pairing.cpp:119:    auto cert = add_cert->pop();

Edit: to clarify, I tried adding locking to the peek() function before trying the "try_pop()" strategy, but it didn't seem to help resolve the deadlock. I am curious why you were still seeing deadlocks with the reverted pop() function but I wasn't; perhaps I didn't test enough to reproduce the issue.

I've had that happen as well when adding a lock to peek() but was able to freeze it eventually (also managed to hit the vulkan crash with this once). From a logic perspective adding the lock to just peek() likely won't help that much because its not an atomic lock on the whole peek pop segment.

@psyke83

psyke83 commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Yes, the while(peek()) { pop(); } type loops are the main candidates for this type of change. An arbitrary pop() timeout is also a reasonable solution, as long as we can agree that the "while peek() -> unbounded pop()" type loops are inherently unsafe as queue draining checks and should probably be changed.

It seems reasonable to assume that one of these loops explains #5241; in fact, if we assume that a reinit was triggered at the time of those freezes, then this exact call site is also the exact culprit.

Needless to say, this change didn't improve the display or mode switch crashing for Vulkan on my system, so you're likely correct that it's an ffmpeg bug. I don't recall it being so unstable during mode changes when I last tested some weeks/months ago.

@Kishi85

Kishi85 commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

Yes, the while(peek()) { pop(); } type loops are the main candidates for this type of change. An arbitrary pop() timeout is also a reasonable solution, as long as we can agree that the "while peek() -> unbounded pop()" type loops are inherently unsafe as queue draining checks and should probably be changed.

Agreed, for queue draining that exact deadlock scenario is the problem. It's especially critical in this specific case because it's draining the captured images from the capture context, which when being stuck on _cv.wait() indefinitely will cause the video stream to freeze. The other cases are likely not as critical but should be checked as well. I'm tending to leave those changes for a separate PR (done by someone other than me with more understanding of those call sites).

It seems reasonable to assume that one of these loops explains #5241; in fact, if we assume that a reinit was triggered at the time of those freezes, then this exact call site is also the exact culprit.

That assumption is also reasonable as the push_captured_image_cb -> false can also be caused by encoder session errors, not only by display switching.

Needless to say, this change didn't improve the display or mode switch crashing for Vulkan on my system, so you're likely correct that it's an ffmpeg bug. I don't recall it being so unstable during mode changes when I last tested some weeks/months ago.

I've only been able to trigger this issue once so far (on fully updated CachyOS with 9700XT) but that race condition might be dependent on multiple factors (FFmpeg/Mesa/Hardware) and should be debugged further in a separate issue/PR.

@Kishi85 Kishi85 force-pushed the workaround-video-stream-freeze-on-display-switch branch from 823ce83 to e787fbc Compare June 25, 2026 09:06
@Kishi85 Kishi85 changed the title fix(linux/pipewire): avoid video stream freeze (on display switch) fix(video): fix video stream freeze (on pipewire display switch) Jun 25, 2026
@Kishi85 Kishi85 changed the title fix(video): fix video stream freeze (on pipewire display switch) fix(video): fix video stream freezing (on pipewire display switch) Jun 25, 2026
@Kishi85 Kishi85 changed the title fix(video): fix video stream freezing (on pipewire display switch) fix(video): fix video stream freezing on capture re-init (e.g. pipewire display switch) Jun 25, 2026
@ReenigneArcher ReenigneArcher force-pushed the workaround-video-stream-freeze-on-display-switch branch from e787fbc to 9c44762 Compare June 26, 2026 03:04
@ReenigneArcher

ReenigneArcher commented Jun 26, 2026

Copy link
Copy Markdown
Member

@Kishi85 is this ready to merge?

Edit: I recently made a change forcing more doxygen documentation.

Generating docs for compound platf::pa::add_const_helper.../home/docs/checkouts/readthedocs.org/user_builds/sunshinestream/checkouts/5249/src/thread_safe.h:52: error: Member try_pop() (function) of class safe::event_t is not documented.

Kishi85 and others added 3 commits June 26, 2026 07:07
Before it was stopped at the beginning and end of the destructor.
Doing it once at the end is what makes sense logically.
This is a trial-and-error workaround to avoid video stream freezing when
display switching that somehow happens when not using the added thread
sleeps due to a racecondition that needs to be further diagnosed.

Pipewire state change logging is also move to information log level to
improve initial issue analysis without enabling debug logging
…in video.cpp

Also remove sleep workaround from pipewire.cpp

Co-Authored-By: psyke83 <psyke83@users.noreply.github.com>
@Kishi85 Kishi85 force-pushed the workaround-video-stream-freeze-on-display-switch branch from 9c44762 to 1c97c44 Compare June 26, 2026 05:09
@Kishi85

Kishi85 commented Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

@Kishi85 is this ready to merge?

Yes, it's looking good to me and @psyke83 . Also been testing this for the last day without any issues. For the remaining codeparts that have deadlock potential I'll open an issue so checking those will not be forgotten.

@psyke83 Would you open an issue (or PR) for the vulkan crash? I'm not experiencing this reliably enough to provide the necessary details for that.

Edit: I recently made a change forcing more doxygen documentation.

Generating docs for compound platf::pa::add_const_helper.../home/docs/checkouts/readthedocs.org/user_builds/sunshinestream/checkouts/5249/src/thread_safe.h:52: error: Member try_pop() (function) of class safe::event_t is not documented.

Added, always good to have more documentation (where appropriate).

@Kishi85

Kishi85 commented Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

Note: This will just fix one issue that can occur when display switching (especially with event-based methods like pipewire). Other still known issues:

@ReenigneArcher ReenigneArcher merged commit e40d355 into LizardByte:master Jun 26, 2026
70 of 71 checks passed
@sonarqubecloud

Copy link
Copy Markdown

vindeckyy pushed a commit to vindeckyy/Solar-Flare that referenced this pull request Jun 27, 2026
…re display switch) (LizardByte#5249)

Co-authored-by: psyke83 <psyke83@users.noreply.github.com>
(cherry picked from commit e40d355)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants